Adaptive evaluation of text search queries with blackbox scoring functions

ABSTRACT

Disclosed is an evaluation technique for text search with black-box scoring functions, where it is unnecessary for the evaluation engine to maintain details of the scoring function. Included is a description of a system for dealing with blackbox searching, proofs of correctness, as well experimental evidence showing that the performance of the technique is comparable in efficiency to those techniques used in custom-built engines.

CROSS REFERENCE TO RELATED APPLICATION

[0001] This application claims priority under 35 U.S.C. §119(e) to co-pending U.S. Provisional patent application 60/474,877 filed on May 30, 2003.

TECHNICAL FIELD OF THE INVENTION

[0002] This invention relates generally to methods, apparatus and computer programs for execution of text search queries over a large body of data.

BACKGROUND OF THE INVENTION

[0003] Searching a body of documents for specific information is increasingly important, in an increasing number of systems. This introduction examines searching and ranking techniques for responding to a specific search request. Typically, providing best available information calls for scoring information in each document and then by ranking the information or the documents according to relevance. A variety of techniques are used to complete such tasks.

[0004] Custom built solutions typically offer acceptable performance for searching large bodies of data. Examples include those available for searching the Internet. Some of the commercially available solutions include biases that either speed the search or qualify data.

[0005] Experience has shown that it is typically a substantial challenge to meet performance expectations within the confines of certain systems. Examples of systems where challenges arise include general purpose database management systems and content management systems. For developers of searching algorithms, challenges to meeting expectations in such systems include balancing the concepts of ranking and approximation, as well as providing for a generality of purpose. These, and other concepts, are discussed in more detail to provide some perspective.

[0006] Ranking and approximation specify what to return when there are too many or too few results. One may consider these concepts to be at different ends of a single continuum. In order to provide desired searching capabilities, it is considered preferable that typical database systems should incorporate ranking into the generic and extensible architecture of the database engine. Typical database systems do not integrate the concepts of ranking and approximation. New and different ranking criteria and ranking functions should be easily incorporated into a query processing runtime. Preferably, database systems should not use a biased ranking method.

[0007] For more perspective, consider the following aspects of ranking text in databases. Note that information retrieval (IR) literature contains many ranking heuristics. A few of these heuristics, to which later reference will be made,include the Term Frequency Inverted Document Frequency (TFIDF) function, Static Rank functions, Searching by Numbers, Lexical Affinities (LA) and Salience Levels (SL), as well as other functions.

[0008] One common method for ranking text is by use of the TFIDF score. This is calculated as: ${{TFIDF}\left( {q,d} \right)} = {\sum\limits_{t\quad \in \quad q}\frac{\phi_{t,d}}{\Gamma_{t}}}$

[0009] Here, q represents the query, φ_(t,d) is the number of times term t occurs in document d, divided by the total number of terms in the document d. Γ_(t) is the number of documents which contain term t. A discussion of TFDIF is provided in the reference “Managing Gigabytes,” I. H. Witten, A. Moffat, and T. C. Bell. Morgan Kaufman, San Francisco, 1999.

[0010] One example of a search engine that uses the Static Rank is the GOOGLE search engine, which uses PageRank. Typically, Static Ranks are used in combination with query dependent ranks such as TFIDF. As an example, scoring where the combination is used can be accomplished using a metric such as:

COMBIN(q, d)=αSTATIC(d)+TFIDF(q, d)

[0011] The combination presumes that some documents are generally better than others, and therefore, should be favored during retrieval. A discussion of Static Ranks is presented in the publication by S. Brim and L. Page, and entitled “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” as published in Proceedings of the 7th International World Wide Web Conference (WWW7), 1998. This publication also discusses PageRank.

[0012] As an example of keyword based querying of structured datasets, consider the following example of Searching by Numbers. In the example, a user enters a few numbers into a search bar, for instance, “1 GHz, 256M” and the search system translates the query automatically to something like:

(processorSpeed≈1 GHz)

.and. (memoryCapacity≈256 MB)

[0013] In addition to automated translation to a structured form, results are ranked based on a discrepancy between the requested value and actual value returned for the two parameters of interest. A discussion of Searching by Numbers is presented in the publication by R. Agrawal and R. Srikant, entitled “Searching with Numbers,” published in the Proceedings of the 2002 International World Wide Web Conference (WWW2002), Honolulu, Hi., May 2002.

[0014] Lexical Affinities and Salience Levels are described as score boosting heuristics. In the case of Lexical Affinities (LA), a score is boosted when two terms in the query appear within a small window of each other. In the case of Salience Levels (SL), the score is boosted when a query term appears with increased prominence such as in the title, a paragraph heading, or with bold and/or italicized text. Score boosting methods such as the use of LA and SL are commonly used in modern information retrieval systems. A discussion of Lexical Affinities and Salience Levels is provided in the publication by Y. Maarek and F. Smadja, and entitled “Full text indexing based on lexical relations: An application: Software libraries,” appearing in the Proceedings of the Twelfth International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 198-206, Cambridge, Mass., June 1989. Further examples are provided in the publication by E. M. Voorhees and D. K. Harman, and entitled “Overview of the Tenth Text Retrieval Conference (TREC-10),” appearing in the Proceedings of the Tenth Text Retrieval Conference (TREC-10), National Institute of Standards and Technology, 2001.

[0015] Another popular scoring function, referred to as OKAPI, is discussed in the publication by S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford, entitled “Okapi at TREC-3,” appearing in Proceedings of the Third Text REtrieval Conference (TREC-3), pages 109-126. National Institute of Standards and Technology. NIST, 1994.

[0016] Presently, there is a debate by those skilled in the art over the choice of “term at a time” (TAAT) search strategies versus “document at a time” (DAAT) search strategies. One example of the various perspectives on these strategies is provided in the publication by H. Turtle and J. Flood, entitled “Query evaluation: Strategies and optimizations,” appearing in Information Processing and Management, 31(6):831-850, 1995.

[0017] Typically, a TAAT search engine maintains a sparse vector spanning the documents. The TAAT search engine iterates through the query terms and updates the vector with each iteration. The final state of the vector becomes the score for each document. TAAT search engines are relatively easy to program and new ranking functions are easily included in TAAT runtimes. Conversely, a DAAT search engines make use of document indices. Typically, a DAAT runtime search engine iterates through documents subject to the search and scores a document before proceeding to the next one. A heap maintains the current top l documents identified.

[0018] In the context of large data sets, it is considered by some that the index based DAAT runtime search engine outperforms the vector based TAAT search engine during query execution. However, DAAT runtimes are hard to implement. For example, each DAAT ranking engine is typically built as a custom system, rather than being implemented on top of a general purpose platform such as a database system. Typically, this is due to the fact that commercial database indices have little or no support for the ranking heuristics used in the text.

[0019] To address this issue, DAAT engines have typically been built using a two layer architecture. The user's query would first be translated into a Boolean query. A lower stage performs retrieval based on the Boolean query (or near Boolean query, that is a Boolean query with a “near” operator) which is then passed to a ranking stage for a complete evaluation. Thus, the Boolean stage acts as a filter which eliminates documents which have little or no relevance to the query or are otherwise unlikely to be in the result set.

[0020] From a runtime optimization perspective, two layer DAAT architecture has two potential problems. First, there is the need for a middle layer which translates a query into the Boolean form. This can be a complicated process. For example, translating to a Boolean “AND” of all the query terms may not return a potential hit, while translating to a Boolean “OR” may be an ineffective filter. Thus, depending on how effective the Boolean filters are, the DAAT search may end up performing a significant amount of extra input and output operations. Second, effective translations can lead to complicated intermediate Boolean queries. Consequently, the filters associated with even simple scoring functions such as TFIDF or COMBIN can present daunting optimization problems.

[0021] For further reference, the merge operator and the zig-zag join operators are described in the publication by Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina, entitled “Building a Distributed Full-Text Index for the Web,” appearing in ACM Transactions on Information Systems, 19(3):217-241, July 2001; and the publication by H. Garcia-Molina, J. Uliman, and J. Widom, entitled “Database System Implementation,” appearing in Prentice-Hall, 2000. (another incomplete reference).

[0022] From a functional perspective, a scoring function uses more information than the Boolean filters. For instance, a TFIDF requires determination of the quantity φ_(t,d,) which requires more resources than determining if a term is present in a document. Per document scores (such as STATIC) and per term statistics (such as Γ_(t)) are used in scoring. Scoring in Searching by Numbers requires use of the numerical values in addition to indices. Heuristics such as LA and SL require information about where in the document and in what context any term occurred. In short, a typical information retrieval engine may use many heuristics and combine them in complicated ways. Examples are provided in two further publications. Consider the publication by David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoelle S. Maarek, and Aya Soffer, entitled “Static index pruning for information retrieval systems,” appearing in the Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 41-50, New Orleans, LA., September 2001; and the publication by “R. Fagin, R. Kumar and D. Sivakumar, entitled “Top k Orderings and Near Metrics,” appearing in To appear, 2001. (note that this is an incomplete citation to a reference). Such schemes often require data to be provided to support the ranking decisions.

[0023] In a typical modern day search engine, the scoring function and the runtime engine are co-designed. This prior art arrangement is depicted in FIG. 1.

[0024] Referring to FIG. 1, a prior art search engine 5 receives a query 8 and provides results 9. This process calls for use of parser 3, which builds a token table 4 from a base table 2. Typically, the base table 8 is a collection of documents 7. The token table 4 includes tokens 13, which may be considered as segments or compilations of relevant information from a document 7. From the token table 4, an index table 6 is built. Using the input query 8, the prior art search engine 5 typically includes an embedded scoring function to provide scoring and ranking of information in the index table 6.

[0025] While this arrangement usually means that runtime optimization performs well, such designs come at the cost of versatility. For example, some text search engines have been customized to the extent of including special purpose hardware to speed up critical operators. This has made sense for certain applications, especially in the case where there are few scoring functions which are of concern. In such instances, there is no need for versatility as understanding the scoring function allows for better scheduling of the runtime operators. However, such engines are typically not very useful in contexts other than those for which they were developed.

[0026] It is important that a generic search engine provide a generic interface to support the varied search functions. In particular, the scoring function used to rank results should be “plug and play.” That is, what is needed is a runtime search engine for text search and ranking where the scoring function is queried as a “black box.”

SUMMARY OF THE INVENTION

[0027] The foregoing and other problems are overcome by methods and apparatus in accordance with embodiments of this invention.

[0028] Disclosed herein is a computer program product embodied on a computer readable medium, the computer program product providing computer instructions that implement a text and semi-structured search algorithm having a function having an input for receiving, while there is at least one candidate location in an order of locations, a score range for the candidate location, the algorithm comparing the score range to a threshold- within a range of possible scores, wherein if a lower bound of the score range for the candidate location exceeds the threshold then the candidate location is retained as a result and a next location is selected, and wherein if an upper bound of the score range is at or below the threshold the candidate location is discarded and the next location is selected, and wherein if the score of the candidate location is indeterminate, then the score range for the candidate location is refined.

[0029] Also disclosed is a system for implementing a text and semi-structured search algorithm, that includes a processor for operating an algorithm that has an input for receiving from a blackbox scoring function a score for at least one candidate location in an order of locations, wherein the algorithm compares the score to a threshold, and if the score exceeds the threshold then the candidate location is stored as a result and a next location is selected, and if the score is at or below the threshold the candidate location is discarded and the next location is selected, and wherein if the score of the candidate location is indeterminate, then the score for the candidate location is refined; wherein each result is stored in a table of results ordered by relevance.

[0030] Further disclosed is a method for implementing a search of locations in a body of text and semi-structured data for relevant terms, which includes: providing an index of locations formed of terms, wherein a score for the relevant terms in a candidate location is provided by a scoring function and associated with the candidate location; and, while there are candidate locations: refining the score range if the score of the candidate location is indeterminate, otherwise, storing each candidate location as a result if a lower bound of the score range for the candidate location exceeds a threshold within a range of possible scores, discarding the candidate location if the score range is at or below an upper bound for the score range and selecting a next location.

BRIEF DESCRIPTION OF THE DRAWINGS

[0031] The above set forth and other features of the invention are made more apparent in the ensuing Detailed Description of the Invention when read in conjunction with the attached Drawings, wherein:

[0032]FIG. 1 depicts a prior art embodiment of a search engine;

[0033]FIG. 2 depicts a search engine in accordance with the teachings herein;

[0034]FIG. 3 depicts progress of a query through a series of documents d_(x);

[0035]FIG. 4 depicts the processing expense for running a blackbox learning algorithm;

[0036]FIG. 5 depicts the processing expense as a fraction of total processing requirements;

[0037]FIG. 6 depicts performance of the Algorithm A for a complex query;

[0038]FIG. 7 depicts aspects of performance as a result of adjusting the threshold;

[0039]FIGS. 8- 9 depict performance as a result of “unsafe” approximations;

[0040]FIG. 10 depicts performance where scoring functions are combined; and,

[0041]FIG. 11 depicts aspects of components for execution of Algorithm A.

DETAILED DESCRIPTION OF THE INVENTION

[0042] Disclosed herein is a generic search engine that provides a generic interface to support a variety of search functions. In particular, scoring functions 12 used to rank results are queried as a “black box,” where an evaluation engine does not require information regarding aspects of the scoring functions 12. An example is depicted in FIG. 2.

[0043] Referring to FIG. 2, searching is completed by use of a runtime engine 11 that receives an input query 8. Using the input query 8, the runtime engine 11 employs an algorithm, disclosed herein in non-limiting embodiments as “Algorithm A,” or as “A”, to generate results 9. Algorithm A provides an interface to any one or more of various preselected scoring functions 12 to obtain scoring and ranking information used to provide the results 9. Non-limiting examples of scoring functions 12 useful for practice of the teachings herein include: TFIDF, OKAPI, Static Rank, Lexical Affinities, Salience Levels, in addition to Boolean functions and threshold predicates. As one may surmise, the techniques disclosed herein generally provide for scoring and ranking at the level where indexing occurs.

[0044] Blackbox Scoring. A first device, the parser 3, produces data for interpretation by the scoring function 12. Thus, parsing and scoring are intimately related. Therefore, a blackbox model for scoring, as depicted in FIG. 1, specifies a corresponding model for parsing and tokenization. In the context of a database, the flow may be triggered by a statement, such as call in the form of the SQL statement below:

[0045] CREATE TEXT INDEX <indexname>

[0046] ON TABLE <tablename>

[0047] USING <parsername 1>FOR <column 1>

[0048] USING <parsername 2>FOR <column 2>

[0049] RANKED BY SCORE <scoringfunction>

[0050] First Steps: As an introduction, it is clear that blackbox scoring for a top-i query can be performed via a table scan of the base table 2. This is shown in Table 1. Note that the algorithm in Table 1 does not address aspects of the functionality of parsers 3 and scoring functions 12, and is therefore only illustrative. TABLE 1 Table Scan (prior art) 1 Initialize heap 2 while(docs left) 2.1   score the next document 2.2   if it is in top l, put in heap 3 return the heap

[0051] Since the score is zero unless the document 7 contains some token in the query 8, only those documents 7 which contain at least one token 13 related to the query 8 need be scored. This provides a basis for speeding the runtime engine 11 by use of the index table 6. Another example is shown in Table 2. TABLE 2 Faster Scan (prior art) 1 initialize heap 2 while(docs containing a query token left) 2.1   score the next such document 2.2   if it is in top l, put in heap 3 return the heap

[0052] Note that the algorithm in Table 2 is not optimal. For example, some scoring functions 12 may evaluate to zero, even when some terms in the query 8 are present. One example is the Boolean AND function. Thus, the filtering provided in step 2 leaves room for improvement.

[0053] As disclosed herein, a scoring function 12 which allows input of generic parameters (i.e., “wildcards”) can be used as a black box within an efficient and generic runtime 11. Such a scoring function 12 can be used to do a partial evaluation of the score without having to collect all relevant parameters associated with the document 7. This can present significant benefits, as in the case where the relevant parameters are scattered in storage (e.g., over a disk in a text index).

[0054] The algorithm disclosed herein is one (non-limiting) embodiment of a generic algorithm “A.” One embodiment of Algorithm A is presented in Table 3, below. Algorithm A is described herein in terms of two non-limiting subroutines, nextCand( ) and refine( ). In some embodiments, Algorithm A iterates through documents 7 using the function nextCand( ). Algorithm A uses partial score evaluations to avoid retrieving document parameter values from storage. As used herein, the terms “lower” and “upper” represent the lower bounds and the upper bounds on a range of possible scores for “candidate.” As Algorithm A proceeds, Algorithm A takes one of three options. For example, if the current candidate is in the top l found so far (see step 2.2), Algorithm A adds it to the heap and continues to the next candidate. If the candidate is not in the top l (step 2.3) Algorithm A goes on to the next candidate. If the status of candidate cannot be determined (step 2.4), then Algorithm A tries to refine( ) the score. As a side effect both refine( ) and candidate change the values of lower and upper. TABLE 3 Algorithm A 1 initialize heap and set candidate = 0 2 while(candidate exists) 2.1 Let threshold be the smallest score in heap 2.2   If(threshold < lower)     add candidate to the heap     candidate = nextCand( ) 2.3   else if (threshold μ upper)     candidate = nextCand( ). 2.4   Else     refine( ) 3 return the heap

[0055] Note that both algorithms in Tables 1 and 2 are specializations of Algorithm A. In the case of the algorithm in Table 1, nextCand( ) returns candidateDoc+1. In the case of the algorithm in Table 2, nextCand( ) returns the next document 7 which contains at least one of the terms in the query 8. In both cases, refine( ) does a full evaluation of the score, preferably by reading all the parameters from disk. Also, note that Algorithm A can be modified to work in “streaming” mode. In this case, there will be no heap and the threshold will be provided by a caller. Refer to Table 4 for a non-limiting example of the Algorthm A modified to work in streaming mode. Note that in the Algorithm A provided in Table 4, the threshold can be increased in each call to next( ). TABLE 4 Algorithm A modified for streaming mode 1 init( ) {   set candidate = 0 } next(threshold) {   candidate = nextCand( )   while( candidate exists ) {     if( threshold < lower )       return candidate     else if ( threshold >= upper )       candidate = nextCand( )     else       refine( )   }   return end of results }

[0056] Efficient design of nextCand( ) and refine( ) is important to providing desired performance in the operation of Algorithm A. To this end, some additional introduction is provided regarding parsing, scoring functions 12, text indexes 6, and how text indexes 6 incur I/O penalties. Subsequently, aspects of implementing both nextCand( ) and refine( ) for a blackbox scoring function 21 that supports wildcarding is provided.

[0057] The Parser 3. A parser 3 effects the transformation from the base table 2 to the token table 4 shown in FIG. 2 and detailed in Tables 5 and 6. The token table 4 need not be stored, but can be streamed directly into the index build phase. The token table 4 represents a ternary relation, (d, t, θ_(t,d)). Here, d is the “document” 7 or the row identifier (RID) where the token t was found. The symbol θ^(t,d) represents information about token t within document d needed by the scoring function 12 in determining the score of the document 7. This could, for instance, include information about the salience of the token 13, its location(s) within the document 7, the number of times the token 13 occurred in the document 7, as well as other information. The generic search system considers θ_(t,d) opaque binary data and does not attempt to interpret it. However, it is responsible for storing and retrieving θ_(t,d) to and from the index table 6. TABLE 5 d Subject Student Grade 1 English Adam Smith B 2 Math John Grisham C 3 Math John English A 4 English Johnny Davis A

[0058] TABLE 6 d t θ_(t,d) 1 English Subject 1 Adam Student 1 Smith Student, Bold 2 Math Subject 2 John Student, Emphasis 2 Grisham Student . . . . . . . . .

[0059] In this example, Table 5 represents the base table 2 having two indexable columns, “Subject” and “Student.” Table 6 provides a first few rows of the token table 4 corresponding to the base table 2.

[0060] Assume, without loss of generality, that the pair (t, d) is a unique key for the token table 4 (otherwise, concatenate the set of associated θ values). Thus, the reference to θ_(t,d) is unambiguous. Consider that θ_(t,d) is null if the token table 4 contains no entry corresponding to the pair (t, d). Otherwise, assume that document d contains t. Thus, per the example in Table 6, θ_(John,1) is null and the document d contains “Grisham” and “Math.”

[0061] Preferably, a user can create and register new parsers 3 for any column or data type in concert with creating and registering new scoring functions 12.

[0062] Scoring Functions 12. Consider scoring functions 12 having generic input capability (i.e., “wildcard” capability). In the following discussion, differences between SCORE, the intended scoring function 12 (e.g., TFIDF and OKAPI) and score, an implementation of SCORE which supports wildcarding are discussed. Specifically, associated with each query q involving tokens t₁, t₂, . . . t_(k), is a blackbox scoring function score(x₁, x₂, . . . , x_(k)). If partial evaluation is performed by setting some of the x_(i) to θ_(ti,d) and others to “huh” (a wildcard value), score returns a range (lower, upper) giving lower and upper bounds on the document score SCORE(d). Preferably, the score function exhibits the properties set forth in Table 7. TABLE 7 score Function Properties 1 Monotonicity: lower does not decrease and upper does not increase if a huh parameter is changed to any definite null or non-null θ value. 2 If no parameter is set to huh, upper = lower = SCORE (d) 3 lower ≦ upper

[0063] Note that any correctly implemented score function does not impose restrictions on the scoring metric SCORE. Most, if not all, commonly used scoring functions 12 admit wildcard capable implementations that satisfy the properties in Table 7.

[0064] Text Indexes 6 and Skip Sequential Iterators. A Skip Sequential Iterator (SSI) is a convenient interface to a text index 6. The iterator I_(t) corresponds to a token t (i.e., term) and iterates over all documents d containing t. Table 8 contains a definition for one embodiment of an SSI. TABLE 8 Skip Sequential Iterator (SSI) - Definition The iterator I_(t) is associated with a state s, initially 0. Unless s is 0 or ≡, s is a document id containing t. In addition, the iterator I_(t) has the following interface: 1 loc( ) returns the current state s. 2 data( ) returns θ_(t,s). 3 next(a) sets s the smallest b μ a such that b contains t. If there is no such document id, it sets s to ≡.

[0065] The Algorithm A maintains a collection of SSIs, {I_(t)}, one per token t in the query. Initially, each iterator I_(t) is at 0. The Algorithm A moves the iterators I_(t) by making calls to I_(t).next(candidate). This call is denoted herein as: toss(t).

[0066] Note that a side-effect of the toss(t) operation is that the data value, θ_(t, candidate) is known. If after a toss(t) call I_(t).loc( )=candidate, θ_(t, candidate) is known to be I_(t).data( ). Otherwise, θ_(t, candidate)=null. Algorithm A uses toss(t) calls to read parameter values in the refine( ) subroutine.

[0067] The following assertion follows from the discussion regarding the steps taken by algorithm A.

[0068] Lemma: As long as candidate only increases, and the iterators I_(t) are only moved using toss(t) operations, θ_(t,d) is null whenever candidate≦d≦I_(t).loc( ).

[0069] Defining Subroutines. Aspects of the subroutines nextCand( ) and refine( ) are now defined.

[0070] The nextCand( ) function. For convenience, changes are made in the notation used, where the tokens 13 are renamed so that I_(t1).loc( )≦I_(t2).loc( )≦. . . I_(tk).loc( ). Consider S(d) to be defined as:

S(d)=score (huh,huh, ^(. . .) , θ_(ti,d), θ_(ti+1,d), ^(. . .) , null, null ).upper   (Eq. 1)

[0071] where t_(i), t_(i+1), . . . are the tokens t, whose iterators I_(t) are at I_(t).loc( )=d. For these tokens 13, θ_(t,d) is available to use without I/O, since θ_(t,d)=I_(t).data( ). Terms t, whose iterators I_(t) are at I_(t).loc( )<d, are parameterized by huh, and those whose iterators I_(t) are at I_(t).loc( )>d are parameterized by null.

[0072] As shown in FIG. 3, nextCand( ) returns the smallest d>candidate such that S(d)>threshold. As d increases from left to right, the x_(i) in Eq. 1 changes from null to a definite θ (at d=I_(t).loc( )( )), to huh at d=I_(t).loc( )( )+1. Therefore, it is considered that nextCand( ) can be implemented via a linear search using at most 2k blackbox invocations of score. The computation of nextCand( ) typically does not involve I/O, as all the parameter values required for computing S(d) are available. Referring to FIG. 3, the nextCand( ) function returns the smallest document id d such that S(d) exceeds threshold. In this case, this is d₃+1.

[0073] The Lemma implies that S(d) is an upper bound on SCORE(d). This is because all definite parameters (not huh) used in the evaluation of S(d) are in fact the correct parameters for document d (either θ_(t,d) or null). Therefore, all documents 7 skipped by nextCand( ) are not qualified to enter the heap. Finally, since candidate is only updated using a nextCand( ) call, the score for candidate does not decrease. Therefore, a corollary to the Lemma is proffered.

[0074] Corollary. As long as refine( ) is implemented using only toss(t) operations, Algorithm A will produce correct results 9.

[0075] The refine( ) function. Notice that a toss(t) operation simultaneously reads several θ values. Letting s be I_(t).loc( ) after the toss(t) operation, θ_(t,d) is known for any d χ [candidate, s]. In this case, all but θ_(t,s) are null. Therefore, refine( ) should operate to choose a term t to toss such that the status of candidate and as many succeeding document ids as possible are resolved by the toss. Clearly, refine( ) can only toss terms whose current locations are smaller than candidate. For all other terms t, θ_(t,candidate) is identified by refine( ). One can measure (or “learn”) the effectiveness of t dynamically by noting exactly how far candidate advanced following a toss(t) operation and attributing this progress to t. The amount attributed to t can be 0 if the status of candidate was not resolved, or a larger number if candidate was advanced by a lot. To this end, Equation 2 provides: $\begin{matrix} {\Gamma_{t} = \frac{{total}\quad {progress}\quad {attributed}\quad {to}\quad t}{{number}\quad {of}\quad {{toss}(t)}\quad {operations}}} & {{Eq}.\quad (2)} \end{matrix}$

[0076] Assuming that the values Γ_(t) have converged, the token t with the largest value of Γ_(t) is chosen for the toss. Tossing any token 13 such that I_(t).loc( ) μ candidate would be meaningless since the ordering invariant implies that the value of θ_(t, candidate) is known. Thus, the token 13 with the largest value of Γ_(t) among those for which I_(t).loc( )<candidate is tossed. Γ_(t) may be evaluated using other techniques such as geometric mean, moving averages, logarithmic scaling and others. In some embodiments, Γ_(t) is provided as an input having a known value.

[0077] Experimental Evidence Supporting Algorithm A. This section provides experimental evidence affirming the utility of algorithm A. Consider that Algorithm A trades processing speed (CPU cost) for the benefit of the versatility offered. Therefore, criteria for evaluating the utility of Algorithm A evaluates whether the trade is warranted. In order to address this evaluation, three tests are proffered, with a subsequent evaluation of performance.

[0078] First, the additional computational cost (i.e. CPU cost) of dealing with blackbox scoring functions is considered. Secondly, the fraction of the total cost of query processing that the CPU cost represents is evaluated. Third of all, functions which are difficult to optimize are considered.

[0079] In short, it has been determined that CPU cost is moderate. In tests performed, the CPU cost was never worse than a factor of two, even when the scoring functions 12 were simplistic (i.e., Boolean AND and Boolean OR functions). These were considered to be the worst case, since the optimized non blackbox code was shown to perform well in these cases. It was found that the fraction of the total cost of query processing depends on the cost of input and output (I/O). Testing showed that CPU cost was a negligible fraction of the runtime, as the cost of the I/O operations increased. For instance, if the cost of a toss( ) was at least 0.001 millisecond, (which is considered to be an aggressive estimate by any standard), then the additional cost of the learning computation and using the blackbox was shown to be less than 3%. Since A does not depend on knowing the SCORE function, this overhead is likely to be small even for complicated SCORE functions. Last of all, considering functions where optimization presents challenges, it was found that for a simple four node, two level, tree of un-weighted threshold gates, much like what is commonly used in text processing, Algorithm A performs significantly better in terms of both CPU cost and I/O than the natural extension of a Zig-Zag search or a Merge search. Notice that Zig-Zag and Merge are “locally optimal” for each of the nodes in the tree, and do not share the global perspective on optimization of Algorithm A.

[0080] Performance testing was undertaken by implementing Algorithm A on two platforms. The first (P1) provided an artificial test platform for testing the algorithm function. The second (P2) used a full text index with blackbox scoring functions 12, such as TFIDF, OKAPI, Static Rank, Lexical Affinities, in addition to Boolean functions and threshold predicates.

[0081] An index containing 8 GB of index data and over 1M documents was built on the platform. Both algorithms were run on a personal computer (PC) in a Linux environment, with a 2 GHz CPU. The experiments were performed with a cold I/O subsystem. In the P1 system, the tokens 13 included integers and document d contained all tokens 13 which exactly divided d. Thus, the document “10” contains the words “1,” “2,” “5,” “1.” Document “11” contains the words “1,” “11.” The query “3” and “5” should return all multiples of “15.” The reason that this platform was considered useful was that no I/O is required in implementing the SSIs. The documents d containing a token t are all multiples of t, and so the next( ) function could be implemented “on the fly.” This provided for separating the CPU cost of running A from the cost of the I/O and thus the separate measurement of each. Experiments in this section were implemented on platform P1.

[0082] Aspects of CPU cost are depicted in FIG. 4. FIG. 4 shows that the CPU cost (but for a tiny startup cost) of all algorithms is linear in the number of documents processed. This FIG. establishes that the total CPU cost of running a blackbox learning algorithm is rather small, less than half a second on a base table 2 containing 100 M documents.

[0083]FIG. 4 shows that the CPU cost of query processing algorithms scales linearly with the number of documents to be processed. The figure depicts results for four scoring functions 12, and the CPU cost in msec on a 2 GHz CPU. The X-axis depicts the number of documents 7. The curves depicted represent Algorithm A running Boolean AND and Boolean OR (labeled as and-learn and as or-learn) as well as the optimal Zig-Zag and Merge algorithms for AND and OR respectively. As can be seen, even with 109 (1 Billion) documents, the total CPU time taken in the worst case is rather small, about 5 seconds.

[0084]FIG. 5 depicts aspects of CPU cost as a fraction of total cost. As discussed above, blackbox function evaluations and all the learning needed in picking the correct parameterization for blackbox score functions takes very little CPU usage. FIG. 5 provides an assessment of how this number compares with the total cost of running the query 8.

[0085] In FIG. 5, the overhead CPU cost of the learning Algorithm A is negligible. The figure shows two curves, corresponding to the Boolean AND and the Boolean OR function respectively. The number shows the ratio of clock time taken as a function of I/O cost. The I/O cost was modeled by implementing a spin loop to model a disk access. The X-axis shows the time expended in the spin loop per I/O operation. As can be seen, if the I/O operation were to cost 0.001 millisecond, Algorithm A is 3% worse than the optimal Zig-Zag on AND queries and only 1% worse than Merge on OR queries. Recall that Algorithm A does not know the difference between AND and OR, but self tunes to the correct execution strategy in both cases.

[0086] In FIG. 6, performance for a complex query is depicted. On a two level threshold tree with four gates using a set of strongly correlated and strongly anticorrelated terms (a typical optimization nightmare), Algorithm A performed about 100 times better on both CPU and I/O measures when compared to an algorithm which choose the optimum join method for each node in the tree. Thus, the “global” perspective that Algorithm A takes on optimization results in far more efficient runtime performance.

[0087] About Algorithm A: Generalizing Zig-Zag and Merge. If the SCORE function is a k-way Boolean AND, the optimal runtime strategy is the Zig-Zag join algorithm. In the case of Boolean SCORE functions, the θ values are not relevant. Thus, the interest is only in whether a term t is contained in document d. In the case of Boolean AND, is easy to see that candidate should be the maximum I_(t).loc( ) value. Also, the token t in toss(t) is chosen to be the rarest token whose location is not candidate. Hardcoding these choices results in the Zig-Zag join algorithm. A will automatically mimic this tuned strategy. That is, since score will return a 0 upper and lower bound if even one parameter is set to null, candidate will always be the maximum I_(t).loc( ). S(d) will be 0 for all documents 7 having lower scores. Moreover, Γ_(t) will converge to be large for rare tokens 13 and small for common tokens 13. Algorithm A also generalizes the merge operator for Boolean OR queries. Note that Algorithm A will converge to the optimum behavior in both cases while only making blackbox calls to score.

[0088] Minterms: As stated above, refine( ) is free to toss any input I_(t) such that I_(t).loc( )<candidate. However, tossing some input I_(t) is fruitless when knowledge of θ_(t,candidate) will not affect the decision to keep or discard the candidate. For example, consider the Boolean query (A & B)|(C & D). When I_(C).loc( )<I_(A).loc( )<I_(B).loc( )<I_(D).loc( ), then candidate=I_(B).loc( ). Knowledge of whether C contains candidate is irrelevant; the next location where we are interested in C is at I_(D).loc( ). Therefore, C should not be tossed until candidate=I_(D).loc( ). We say that an-input I_(t) is part of a minterm in the current state when its θ_(t,candidate) can affect the decision to keep or discard the candidate when combined with other inputs.

[0089] An input that is part of a minterm can be found efficiently using monotonicity of the score upper bound when a parameter is changed from huh to null: Order the inputs with I_(t).loc( )<candidate arbitrarily. Recall that these inputs pass huh parameters to the score function. We will use the known θ_(t,candidate) (null or non-null) for inputs with I_(t).loc( )≧candidate. Notice that with all these θ_(t,candidate) and huh values, lower≦threshold<upper because we are in refine( ). We can find the first input from our ordering that is part of some minterm in the following way: one-by-one change a huh to null and reevaluate the score bounds. By monotonicity, upper will decrease or remain constant and lower will increase or remain constant. The first input to make upper≦threshold or threshold<lower is part of some minterm. Such an input will be always be found because if the last huh value is changed to null, then lower=upper and one of the two conditions must be met.

[0090] The minterm algorithm works with any ordering, but some orderings are better than others. In particular, we can order the inputs based upon our preference to toss them, based upon sparsity (given or learned), cost to toss, or some combination. In doing so, we toss the most preferred input from some minterm. Note that that a more preferred input may be part of another minterm. However, every minterm will need to be handled before we can advance the candidate, unless we conclude that the candidate is in the result. For sparsity-ordered inputs, we choose the sparsest input from the densest minterm; this gives us the best chance to prove the candidate is in the result and advance the candidate. To minimize cost, the algorithm can be easily extended to find the least cost input that is part of some minterm (order by decreasing cost and find the complete minterm by restoring the last value to huh and continuing to set huh to null looking for each input that causes a bound to cross the threshold).

[0091] Hard-Coding nextCand( ): In some cases, it is possible to compute nextCand( ) for all possible θ_(t) 4 huh values. This computation further reduces repeatedly calling score.

[0092] Aggressive Heap Initialization. The efficiency of Algorithm A can be increased by using Aggressive Heap Initialization. In embodiments involving this optimization, dummy entries with high (but not too high) scores are inserted in the heap during initialization. This inflates the value of threshold, and consequently, fewer candidate values get examined. In order to illustrate this, refer FIG. 3, and consider what will happen if the threshold value is increased. The danger incurred in using optimization is that fewer than l documents may be returned. Depending on the query 8, such initialization can (but does not always) result in a performance enhancement. In this regard, consider FIG. 7.

[0093]FIG. 7 shows that Algorithm A benefits by aggressive threshold setting, as is expected. The figure shows four standard scoring functions 12, TFIDF, Lexical Affinities (LA), and the aggressively thresholded versions of TFIDF-t and LA-t. The bars chart the performance in terms of the number of toss( ) operations of Top-10 queries on run on a single platform. The threshold for TFIDF-t and LA-t was set to be the best possible, (i.e. the tenth highest score with the conjunctive and disjunctive Boolean queries in the last two columns to provide a baseline to interpret the main results). In queries Q3.2 and Q3.3, aggressive heap initialization helped. In query Q3.2, it helped TFIDF by a factor of 4. In query Q3.3, it helped both TFIDF and LA by a factor of 2.7 and 1.9 respectively. The queries used were proper name search queries of increasing complexity (i.e., they expanded initials, and looked for alternate forms (“Bob” OR “Robert”)).

[0094] Static Scores. Many search engines have a static component to the SCORE function. One example involves use of PageRank, which measures the desirability of a page independent of the query q. Algorithm A can be used to account for static scores as well. In one embodiment, a virtual token 13 is added (e.g., :static:) to each document 7 and set θ_(:static:,d) to be the static score (predetermined score) for d. In another embodiment, the static score is included in every θ_(t,d). When the documents are ordered by decreasing static score in the index (i.e., the documents are considered by Algorithm A in decreasing static score), then the partial score function can use the static score of any document≦candidate as an upper bound for all locations≧candidate. Typically, this results in a decrease in the upper bounds of future documents. In particular, the upper bound when nothing is known about a document (i.e., score (huh, huh, . . . , huh) given the upper bound of static) generally decreases; when the upper bound of score with all huh values is below threshold, Algorithm A terminates early without considering any of the remaining documents.

[0095] Unsafe Approximations in upper. Algorithm A permits under-estimates when computing upper, which trades recall for performance. Typically, the closer the under-estimate, the better the recall. If upper is always an over-estimate, then A will find the exact result. Typically, the closer the over-estimate, the better the performance. Testing in this instance involved use of the Lexical Affinity SCORE function, which scores documents 7 based on the reciprocal of the distance between the query terms. The closer the terms are located within the document 7, the higher the score. A conservative value of upper was derived assuming that each wildcard query term occurs at a distance of one from every other query term. FIGS. 8-9 depict the effect of relaxing this assumption. The X-axis is the reciprocal of the assumed distance between query terms. Unsafe approximations can offer large gains in performance, without necessarily sacrificing significant quality of the results 9.

[0096] Referring to FIGS. 8-9, the two figures show the loss in recall and the gain in performance respectively when the implementation of score does increasingly unsafe approximations of the LA (Lexical Affinity) function by assuming that the default reciprocal distance is the X axis.

[0097] Incremental Evaluation: Algorithm A treats evaluations of SCORE as cheap and consequently may use a number of evaluations. However, if SCORE is relatively expensive to evaluate, A may run into a computational (rather than an I/O) bottleneck. Preferably, incremental evaluation is used to at least partially address potential computational problems. In the Boolean context, this concern can be addressed by hard coding the evaluation of nextCand( ) and refine( ). When dealing with scoring functions, both nextCand( ) and refine( ) are not simply functions of the location order of the simple predicates, but depend in a non-trivial manner on the data associated with iterator locations. Accordingly, an object interface to scoring functions may be used, rather than a functional one. In this embodiment, the scoring object will maintain the state of each of the k input parameters as well as the current candidate location, loc. One example of interface methods to manipulate the state is provided as follows:

[0098] 1. set(i, x, θ) sets the state of parameter i_(χ)[k] to (x, θ).

[0099] 2. setLoc(loc) sets the candidate location to loc.

[0100] 3. getScoreo returns the score.

[0101] Incremental evaluation of SCORE can result in a more streamlined runtime at the cost of some additional programming work. Recalculating the SCORE dynamically when a variable is updated can be significantly cheaper than de novo evaluations of the score.

[0102] Combining Scores and Aggregation. Combining SCORE functions may be performed. For example, if the desired scoring metric is SCORE(d)=λ₁ SCORE ₁(d)+λ₂ SCORE ₂(d), then the implementations of score can be combined using the same ratio. Thus, upper=λ₁upper₁+λ₂upper₂ and likewise lower. Arguably, the most important combined score mixes a document's static score (e.g., PageRank) with a dynamic score (e.g., TFIDF or LA). FIG. 10 shows that Algorithm A behaves sublinearly when it combines scores. This is preferred over when the two score functions are computed independently and then added together.

[0103] As one may surmise, refining the score range for the candidate location may involve any one, or a combination of, techniques for advancing the input iterator I_(t). For example, in one embodiment, the input iterator I_(t) that is set to a location before the candidate location is advanced to the candidate location and the score range is reevaluated. In another embodiment, the input iterator I_(t) is randomly selected for advancement. In another embodiment, input iterators I_(t) are advanced in a round-robin fashion. In another embodiment, the input iterator that is the least expensive is advanced. In one embodiment, the sparsest input iterator I_(t) is advanced. One example of this embodiment calls for identifying the sparsest input iterator by measuring the effectiveness of the input iterator I_(t) to advance the candidate location. In this case, one measure includes measuring the effectiveness by dividing total progress attributed to the input iterator I_(t) by a number of toss operations for the input iterator I_(t).

[0104] In a further embodiment, refining the score range for the candidate location involves advancing a first iterator in an order of iterators set before the candidate location which is chosen such that when all iterators set after the chosen iterator are assumed to occur at the candidate location, all iterators before the chosen iterator are assumed to not occur at the candidate location, the upper bound is above the threshold when the chosen iterator is assumed to occur at the candidate location, and the upper bound is below the threshold when the chosen iterator is assumed to not occur at the candidate location.

[0105] Consider again the choice of index based document at a time (DAAT) strategies versus term at a time vector based (TAAT) strategies for implementing a query. The modern opinion is that for large data sets, the index based runtimes outperform the vector based runtimes. However, index based runtimes are hard to implement and each ranking engine is built as a specially engineered, custom system.

[0106] To address this issue, information retrieval engines are typically built using a two layer architecture. The scoring function 12 is approximated using a Boolean query. The lower stage performs index based retrieval based on the Boolean query (or near Boolean query—Boolean with a “near” operator) which is then passed downstream to the ranking stage for a complete evaluation. This strategy is a more efficient version of algorithm 1.2. The filtering predicate is more selective than a simple Boolean “OR.”

[0107] From a runtime optimization perspective, this architecture has two potential problems. First, there is the need for a layer for approximating query into Boolean form. This can be a complicated or even an impossible process for black box scoring functions. The only viable option may be using a mostly ineffective filter like the Boolean “OR” used in algorithm of Table 2. Second, even if effective approximations were possible, the resulting Boolean filter can be complicated and lead to daunting runtime optimization problems. As an example, a TFIDF like threshold Boolean query requiring any 3 of 5 given terms has a Boolean DNF form involving: ${\begin{pmatrix} 5 \\ 3 \end{pmatrix} = {10\quad {distinct}\quad {disjuncts}}},{\text{-}{each}\quad {of}\quad {size}\quad 3.}$

[0108] From a functional perspective, scoring functions 12 use more information than Boolean filters do. For instance, TFIDF requires an input of the frequency of a term within a document 7, which is more intensive than whether a term is present in the document 7. This information, available but not used in Boolean processing, represents a significant lost opportunity for runtime optimization. This opportunity is usually reclaimed by more special purpose code and compensation within the filtering phase.

[0109] The database point of view. The database equivalent of the Information retrieval TAAT/DAAT question is the choice between bulk join methods such as Sort/Merge and Hash Join and small footprint, index aided joins such as Index Nested Loop joins. A further challenge is in handling XML documents. The challenge is two fold. First, current database community focus in XML retrieval largely concerns the Boolean domain, and does not consider information retrieval issues such as scoring concerns. Second, the retrievable entity is generally no longer a “document” but is arranged within a hierarchy.

[0110] Having described aspects of Algorithm A, one may recognize with reference to FIG. 11 that a system 100 suited for implementation of Algorithm A includes a processor 101, which is coupled to storage 102. Also coupled to the processor is at least one input and output ( 1/0) device 103, such as a keyboard for inputting a user query 8.

[0111] The storage 102 includes a Base Table 2, which is typically managed by a database manager 106. Also stored in storage 20 is the Algorithm A, 105, which draws upon scoring functions 12 as needed. The scoring functions 12 may include those discussed herein, such as Boolean functions or queries, and may include intermediate devices, such as posting lists. Operation of the Algorithm A 105 occurs by the operation of the processor 101, which queries the Base Table 2, to provide results 9. Other components, such as the parser 3, the token table 4 and the index table 6 are typically contained in storage 102.

[0112] This invention thus also pertains to a computer program product embodied on a computer readable medium, such as disk, tape and/or semiconductor or other memory. The computer program product includes computer instructions that, when executed by a data processor, such as the processor 101 of FIG. 11, result in the implementation of the Algorithm A, and methods as described above.

[0113] One skilled in the art will recognize that the invention disclosed herein is not limited to the embodiments set forth. More specifically, it is considered that the embodiment of Algorithm A, as well as the scoring functions discussed, are only illustrative of the invention herein, and are not limiting as other embodiment may be apparent to one skilled in the art. 

What is claimed is:
 1. A computer program product embodied on a computer readable medium, the computer program product comprising computer instructions that implement a search algorithm comprising a function having an input for receiving, while there is at least one candidate location in an order of locations, a score range for the candidate location, the algorithm comparing the score range to a threshold within a range of possible scores, wherein if a lower bound of the score range for the candidate location exceeds the threshold then the candidate location is retained as a result and a next location is selected, and wherein if an upper bound of the score range is at or below the threshold the candidate location is discarded and the next location is selected, and wherein if the score of the candidate location is indeterminate, then the score range for the candidate location is refined.
 2. The computer program product as in claim 1, wherein the next location is selected by choosing the next location greater than the candidate location such that the upper bound exceeds the threshold.
 3. The computer program product as in claim 1, wherein the algorithm receives the score range from a blackbox scoring function.
 4. The computer program product as in claim 3, wherein the blackbox scoring function comprises a Boolean function.
 5. The computer program product as in claim 4, wherein the function comprises instructions for selecting the next location by choosing the next location greater than the candidate location such that the upper bound exceeds the threshold.
 6. The computer program product as in claim 3, wherein the blackbox scoring function comprises at least one of a Term Frequency Inverted Document Frequency (TFIDF) function, a Static Rank function, a Searching by Numbers function, a Lexical Affinities (LA) function a Salience Levels (SL) function and a threshold predicate function.
 7. The computer program product as in claim 3, wherein the blackbox scoring function provides an under-estimate of the score range.
 8. The computer program product as in claim 1, wherein the score range for the candidate location is refined by advancing an input iterator that is set before the candidate location to the candidate location and reevaluating the score range.
 9. The computer program product as in claim 8, wherein advancing the input iterator comprises advancing a randomly selected input iterator.
 10. The computer program product as in claim 8, wherein advancing the input iterator comprises advancing the input iterator in a round robin fashion.
 11. The computer program as in claim 8, wherein advancing the input iterator comprises advancing the input iterator that is the least expensive to advance.
 12. The computer program product as in claim 8, wherein advancing the input iterator comprises advancing the sparsest input iterator.
 13. The computer program product as in claim 12, wherein advancing the sparsest input iterator comprises measuring the effectiveness of the input iterator to advance the candidate location.
 14. The computer program product as in claim 13, wherein measuring the effectiveness comprises dividing total progress attributed to the input iterator by a number of toss operations for the input iterator.
 15. The computer program product as in claim 8, wherein advancing the input iterator comprises selecting a first iterator in an order of iterators set before the candidate location such that when all iterators after the chosen iterator are assumed to occur at the candidate location, and all iterators set before the chosen iterator are assumed to not occur at the candidate location, and the upper bound is above the threshold when the chosen iterator is assumed to occur at the candidate location, and the upper bound is below the threshold when the chosen iterator is assumed to not occur at the candidate location.
 16. The computer program product as in claim 8, wherein advancing the input iterator comprises selecting a combination of techniques.
 17. The computer program product as in claim 1, wherein the candidate location comprises a predetermined score for the score range.
 18. A system for implementing a search algorithm, comprising: a processor for operating an algorithm that comprises an input for receiving from a blackbox scoring function a score for at least one candidate location in an order of locations, wherein the algorithm compares the score to a threshold, and if the score exceeds the threshold then the candidate location is stored as a result and a next location is selected, and if the score is at or below the threshold the candidate location is discarded and the next location is selected, and wherein if the score of the candidate location is indeterminate, then the score for the candidate location is refined; wherein each result is stored in a table of results ordered by relevance.
 19. A method for implementing a search of locations in a body of data for relevant terms, the method comprising: providing an index of locations comprised of terms, wherein a score for the relevant terms in a candidate location is provided by a scoring function and associated with the candidate location; and, while there are candidate locations: refining the score range if the score of the candidate location is indeterminate, otherwise, storing each candidate location as a result if a lower bound of the score range for the candidate location exceeds a threshold within a range of possible scores, discarding the candidate location if the score range is at or below an upper bound for the score range and selecting a next location. 